Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Mapping of Sequence Reads to the Reference Genomes ◾ 51

found including gene annotation in GFF/GTF file format, GenBank format, and tabular

format. The reference transcriptome (whole mRNA of an organism) and proteins may also

be available as shown in Figure 2.1.

For the alignment/mapping of reads produced by sequencing instruments, we may

need to download a reference genome of the species from which the sequencing raw data

are taken. The sequence of the reference genome must be in the FASTA file format. For

example, to download the FASTA file of the human genome, you can copy the link from

“genome” hyperlink on the Genome database web page and on Linux terminal use “wget”

to download the file to the directory of your choice “e.g. refgenome”:

mkdir refgenome

wget \

-O “refgenome/GRCh38.p13_ref.fna.gz” \

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/

GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.

fna.gz

This script will create the “refgenome” directory, where it will download the compressed

FASTA sequence of the human reference genome “GRCh38.p13_ref.fna.gz”. The size of

the compressed current FASTA sequence file of the human genome (GRCh38.p13) is only

921M. We can decompress it using the “gunzip” command.

gunzip -d GRCh38.p13_ref.fna.gz

This command will decompress the reference genome file to “GRCh38.p13_ref.fna” and

the file size now is 3.1G. A large file can be displayed using a program for displaying a large

text such as “less” or “cat” Linux commands. The reference sequences are in the FASTA file

format. A file contains several sequences representing the genomic units such as chromo-

somes. Each FASTA sequence entry consists of two parts: a definition line (defline), which

is a single line that begins with “>” symbol, and a sequence, which may span several lines.

Figure 2.2 shows the beginning of the human genome reference sequence. Notice that the

defline includes the GenBank accession of the sequence, species scientific name, genome

unit (chromosome number), and the human genome Build. A chromosome sequence may

begin with multiple ambiguous bases (Ns). In Figure 2.2, we removed several lines of Ns

intentionally to show the DNA nucleobases.

The following Unix/Linux commands are used with the text files as general and here we

can use them with FASTA files to collect some useful information.

To display the FASTA file content page by page, you can use “less” command:

less GRCh38.p13_ref.fna

To count the number of FASTA sequences in the FASTA file, use “grep” command:

grep -c “>” GRCh38.p13_ref.fna